Sleep is something that has always been a challenge for me. I
struggle to fall asleep and am what some people would call a “night
owl”. Because of this, I have always been fascinated in learning more
about how to have quality sleep. Sleep is essential for all humans.
According to the Sleep Foundation (2024), getting quality sleep can help
you to concentrate, manage the effects of stress, make decisions,
problem solve, heal your body, and fight infections and diseases. These
are all important things that we need in our lives.
For this report, I was interested to see if bedtime has any bearing in
the quality of sleep a person has and if not, what are some indicators
for better sleep?
I found a dataset “Sleep_Efficiency.csv” on “kaggle.com”. I cleaned the column names using the janitor clean_names() function. The raw dataset contain a column “bedtime” which included dates and times. I only wanted to have the time, so I separated that out. I noticed that when I tried plotting the bedtime data that everything was on the right and left side of the graph but there wasn’t anything in the middle. This is because none of the subjects’ bedtime was during the day. What I wanted was for the times to wrap from the evening to the early morning to get a better idea. To do this, any time that was earlier than 6:00 p.m. I added 24 hours to it.
The dataset included different sleep types (Deep, Light, and REM) and the percentage of time the subjects were in each stage. They had these in different columns so I used pivot_longer() to make them into “sleep type” and “sleep type percentage” columns. Once I was done cleaning the dataset, I saved the clean version of the data into a new csv file which is what I used in this report. Loading this data again, I needed to do some cleaning to change to the data types that I wanted. I changed gender, smoking status, and sleep type into factors.
Lets take a quick look at the cleaned dataset
## Rows: 1,356
## Columns: 13
## $ id <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, …
## $ age <dbl> 65, 65, 65, 69, 69, 69, 40, 40, 40, 40, 40, 40, …
## $ gender <fct> Female, Female, Female, Male, Male, Male, Female…
## $ sleep_duration <dbl> 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 6.0…
## $ sleep_efficiency <dbl> 0.88, 0.88, 0.88, 0.66, 0.66, 0.66, 0.89, 0.89, …
## $ awakenings <dbl> 0, 0, 0, 3, 3, 3, 1, 1, 1, 3, 3, 3, 3, 3, 3, 0, …
## $ caffeine_consumption <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 50, 50, 50, 0, 0, 0, …
## $ alcohol_consumption <dbl> 0, 0, 0, 3, 3, 3, 0, 0, 0, 5, 5, 5, 3, 3, 3, 0, …
## $ smoking_status <fct> Yes, Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, Y…
## $ exercise_frequency <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 3, 3, 3, 1, …
## $ sleep_type <fct> REM, Deep, Light, REM, Deep, Light, REM, Deep, L…
## $ sleep_type_percentage <dbl> 18, 70, 12, 19, 28, 53, 20, 70, 10, 23, 25, 52, …
## $ bedtime <dbl> 2500, 2500, 2500, 2600, 2600, 2600, 2130, 2130, …
I was interested to see if bedtime would have any effect on sleep efficiency
# Plot of bedtime effect on sleep efficiency
bed_plot1 <- df %>%
ggplot(aes(x=factor(bedtime), y=sleep_efficiency, color=factor(bedtime))) +
geom_point() +
scale_x_discrete(labels=c("9:00 p.m.", "9:30 p.m.", "10:00 p.m.", "10:30 p.m.", "11:00 p.m.", "12:00 a.m.",
"12:30 a.m.", "1:00 a.m.", "1:30 a.m.", "2:00 a.m.", "2:30 a.m.")) +
labs(x="Bedtime", y="Sleep Efficiency Proportion", title="Scatterplot of Bedtime's Effect on Sleep Efficiency") +
theme_minimal() +
theme(axis.text.x = element_text(angle=25),
legend.position = "none")
# Interactive plot
ggplotly(bed_plot1)
I was surprised that it doesn’t appear as though bedtime had any effect on sleep efficiency.
I did a glm model using all of the data points to see what did have an effect on sleep efficiency. I first separated out REM sleep from the sleep type percentage column.
rem_df <- df %>%
filter(sleep_type == "REM")
mod1REM <- glm(data=rem_df, formula = sleep_efficiency ~ age + gender +
sleep_duration + awakenings + caffeine_consumption +
alcohol_consumption + smoking_status + exercise_frequency + sleep_type_percentage + bedtime)
tidy(mod1REM) %>%
kableExtra::kable() %>%
kableExtra::kable_classic(lightable_options = 'hover')
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.8653406 | 0.0911912 | 9.4892980 | 0.0000000 |
| age | 0.0013841 | 0.0003856 | 3.5899924 | 0.0003744 |
| genderMale | 0.0082960 | 0.0108749 | 0.7628562 | 0.4460262 |
| sleep_duration | -0.0023318 | 0.0055513 | -0.4200353 | 0.6746989 |
| awakenings | -0.0475231 | 0.0038185 | -12.4454136 | 0.0000000 |
| caffeine_consumption | 0.0001605 | 0.0001775 | 0.9044667 | 0.3663256 |
| alcohol_consumption | -0.0237291 | 0.0031003 | -7.6537166 | 0.0000000 |
| smoking_statusYes | -0.0781570 | 0.0106176 | -7.3610714 | 0.0000000 |
| exercise_frequency | 0.0117639 | 0.0037580 | 3.1303375 | 0.0018822 |
| sleep_type_percentage | 0.0016225 | 0.0014384 | 1.1279679 | 0.2600508 |
| bedtime | -0.0000210 | 0.0000322 | -0.6519992 | 0.5147990 |
REM was not significant on sleep efficiency. Instead I will see if Deep sleep is.
deep_df <- df %>%
filter(sleep_type == "Deep") # Time in deep sleep significant. REM is not
mod1 <- lm(data=deep_df, formula = sleep_efficiency ~ age + gender +
sleep_duration + awakenings + caffeine_consumption +
alcohol_consumption + smoking_status + exercise_frequency + sleep_type_percentage + bedtime)
tidy(mod1) %>%
kableExtra::kable() %>%
kableExtra::kable_classic(lightable_options = 'hover')
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 0.5812434 | 0.0582704 | 9.9749417 | 0.0000000 |
| age | 0.0011359 | 0.0002612 | 4.3491404 | 0.0000176 |
| genderMale | -0.0033367 | 0.0073479 | -0.4540953 | 0.6500216 |
| sleep_duration | 0.0017794 | 0.0037713 | 0.4718332 | 0.6373188 |
| awakenings | -0.0323604 | 0.0026860 | -12.0479598 | 0.0000000 |
| caffeine_consumption | 0.0003140 | 0.0001200 | 2.6161485 | 0.0092502 |
| alcohol_consumption | -0.0078486 | 0.0022348 | -3.5120300 | 0.0004986 |
| smoking_statusYes | -0.0436934 | 0.0073560 | -5.9398190 | 0.0000000 |
| exercise_frequency | 0.0069996 | 0.0025578 | 2.7365283 | 0.0065029 |
| sleep_type_percentage | 0.0051885 | 0.0002460 | 21.0909567 | 0.0000000 |
| bedtime | -0.0000284 | 0.0000218 | -1.3066559 | 0.1921261 |
Deep sleep was significant as well as many others. The variables which were significant are:
I was surprised by a few. Bedtime of course, but also about gender and sleep duration not being significant. I would have thought that the longer you slept the more quality of sleep you would have.
We will look at some of these relationships.
# Age
p1 <- deep_df %>%
ggplot(aes(x=age, y = sleep_efficiency, color=factor(age))) +
geom_jitter() +
geom_smooth(color="steelblue", se=FALSE, method="lm") +
theme_minimal() +
labs(title="Age",
x="Age",
y="Sleep Efficiency") +
theme(legend.position = "none")
# Awakenings
p2 <- deep_df %>%
filter(!is.na(awakenings)) %>%
ggplot(aes(x=awakenings, y = sleep_efficiency, color=factor(awakenings))) +
geom_jitter() +
geom_smooth(color="steelblue", method="lm", se=FALSE) +
theme_minimal() +
labs(title="Awakenings",
x="Awakenings",
y="Sleep Efficiency") +
theme(legend.position = "none")
p1 + p2
# Smoking
p5 <- deep_df %>%
ggplot(aes(x=factor(smoking_status), y = sleep_efficiency, fill=factor(smoking_status))) +
labs(title="Smoking Status",
x="Smoking Status",
y="Sleep Efficiency") +
geom_violin() +
theme_minimal() +
scale_fill_brewer(palette = "Paired") +
theme(legend.position = "none")
# Exercise
p6 <- deep_df %>%
filter(!is.na(exercise_frequency)) %>%
ggplot(aes(x=exercise_frequency, y = sleep_efficiency, color=factor(exercise_frequency))) +
geom_jitter() +
geom_smooth(color="steelblue", method="lm", se=FALSE) +
theme_minimal() +
labs(title="Exercise Frequency",
x="Exercise Frequency (weekly)",
y="Sleep Efficiency") +
theme(legend.position = "none")
p5 + p6
# Deep Sleep
deep_df %>%
filter(!is.na(awakenings)) %>%
ggplot(aes(x=sleep_type_percentage, y = sleep_efficiency, color=factor(awakenings))) +
geom_point() +
geom_smooth(se=FALSE, method="lm") +
theme_minimal() +
facet_wrap(~factor(awakenings))
Let’s make some models to see if we can predict sleep efficiency
mod1 <- lm(data=deep_df, formula = sleep_efficiency ~ bedtime)
mod2 <- lm(data=deep_df, formula = sleep_efficiency ~
age + awakenings + sleep_type_percentage)
mod3 <- lm(data=deep_df, formula = sleep_efficiency ~
age * awakenings * sleep_type_percentage)
mod4 <- lm(data=deep_df, formula = sleep_efficiency ~
age + awakenings + alcohol_consumption + smoking_status +
sleep_type_percentage)
We will look at the mean-squared-error’s for each of these models
Mod1
mean(mod1$residuals^2)
## [1] 0.01786273
Mod2
mean(mod2$residuals^2)
## [1] 0.004803478
Mod3
mean(mod3$residuals^2)
## [1] 0.004494944
Mod4
mean(mod4$residuals^2)
## [1] 0.004279514
We will compare the models to see which is the best
compare_performance(mod1, mod2, mod3, mod4)
## When comparing models, please note that probably not all models were fit
## from same data.
## # Comparison of Model Performance Indices
##
## Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2 | R2 (adj.) | RMSE | Sigma
## ------------------------------------------------------------------------------------------------------
## mod1 | lm | -530.6 (<.001) | -530.5 (<.001) | -518.3 (<.001) | 0.021 | 0.019 | 0.134 | 0.134
## mod2 | lm | -1070.2 (<.001) | -1070.1 (<.001) | -1049.9 (0.083) | 0.739 | 0.737 | 0.069 | 0.070
## mod3 | lm | -1090.9 (0.997) | -1090.5 (0.996) | -1054.3 (0.752) | 0.756 | 0.752 | 0.067 | 0.068
## mod4 | lm | -1079.5 (0.003) | -1079.2 (0.004) | -1051.3 (0.165) | 0.768 | 0.765 | 0.065 | 0.066
compare_performance(mod1, mod2, mod3, mod4) %>%
plot()
## When comparing models, please note that probably not all models were fit
## from same data.
From these points, it appears that Mod3 is the best
I made predictions with hypothetical data
# Add predictions
df2 <- add_predictions(deep_df, mod3)
# Make hypothetical values form the independent variables
newdf <- data.frame(age = c(59, 32, 13, 43),
awakenings = c(3, 2, 1, 0),
sleep_type_percentage = c(49, 18, 45, 77))
# Make predictions
pred <- predict(mod3, newdata=newdf)
# New data frame
hyp_preds <- data.frame(age = newdf$age,
awakenings = newdf$awakenings,
sleep_type_percentage = newdf$sleep_type_percentage,
pred=pred)
# Add a new column showing whether a data point is real or hypothetical
df2$prediction_type <- "Real"
hyp_preds$prediction_type <- "Hypothetical"
# Join real data and hypothetical data (with model predictions)
fullpreds <- full_join(df2, hyp_preds)
Predictions plotted alongside real data
ggplot(fullpreds, aes(x = sleep_type_percentage, y = pred, color = prediction_type)) +
geom_point(aes(y = sleep_efficiency), color = "Black") +
geom_point() +
theme_minimal()
The model I made seems to be fairly accurate. From this report, I can see that there are many factors that can attribute to a efficient night of sleep.
References
National Sleep Foundation. (2024). Why do we need sleep? Sleep Foundation. Retrieved from https://www.sleepfoundation.org/how-sleep-works/why-do-we-need-sleep